In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw5.ipynb")

Group: Helena Sokolovska and Marvel Hariadi¶

CPSC 330 - Applied Machine Learning¶

Homework 5: Putting it all together¶

Associated lectures: All material till lecture 13¶

Due date: Monday, Mar 10, 11:59 pm

Table of contents¶

  1. Submission instructions
  2. Understanding the problem
  3. Data splitting
  4. EDA
  5. Feature engineering
  6. Preprocessing and transformations
  7. Baseline model
  8. Linear models
  9. Different models
  10. Feature selection
  11. Hyperparameter optimization
  12. Interpretation and feature importances
  13. Results on the test set
  14. Summary of the results
  15. Your takeaway from the course

Submission instructions¶


rubric={points:4}

You may work with a partner on this homework and submit your assignment as a group. Below are some instructions on working as a group.

  • The maximum group size is 2.
  • Use group work as an opportunity to collaborate and learn new things from each other.
  • Be respectful to each other and make sure you understand all the concepts in the assignment well.
  • It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline.
  • You can find the instructions on how to do group submission on Gradescope here.
  • If you would like to use late tokens for the homework, all group members must have the necessary late tokens available. Please note that the late tokens will be counted for all members of the group.

Follow the homework submission instructions.

  1. Before submitting the assignment, run all cells in your notebook to make sure there are no errors by doing Kernel -> Restart Kernel and Clear All Outputs and then Run -> Run All Cells.
  2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
  3. Follow the CPSC 330 homework instructions, which include information on how to do your assignment and how to submit your assignment.
  4. Upload your solution on Gradescope. Check out this Gradescope Student Guide if you need help with Gradescope submission.
  5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope.

Note: The assignments will get gradually more open-ended as we progress through the course. In many cases, there won't be a single correct solution. Sometimes you will have to make your own choices and your own decisions (for example, on what parameter values to use when they are not explicitly provided in the instructions). Use your own judgment in such cases and justify your choices, if necessary.

A tuned decision tree (max_depth=3) appears to be the best model (Accuracy: 0.8043) for predicting whether or not clients will default, however it is not much better than the dummy model (CV scores: 0.777 vs. 0.808 +/- 0.003). This is further evidenced by the poor recall, precision, f1, and AP score (Precision: 0.5994, Recall: 0.2895, f1: 0.3904, AP: 0.45).

Imports¶

Imports

Points: 0

In [2]:
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.inspection import permutation_importance
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    cross_val_predict,
    train_test_split,
)

import shap 

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SequentialFeatureSelector

Introduction ¶

In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project:

Tips¶

  1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary.
  2. Do not include everything you ever tried in your submission -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code.
  3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions.

Assessment¶

We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results. For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.

A final note¶

Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (15-20 hours???) is a good guideline for this project . Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well.



1. Pick your problem and explain the prediction problem ¶


rubric={points:3}

In this mini project, you have the option to choose on which dataset you will be working on. The tasks you will need to carry on will be similar, independently of your choice.

Option 1¶

You can choose to work on a classification problem of predicting whether a credit card client will default or not. For this problem, you will use Default of Credit Card Clients Dataset. In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with the associated research paper, which is available through the UBC library.

Option 2¶

You can choose to work on a regression problem using a dataset of New York City Airbnb listings from 2019. As usual, you'll need to start by downloading the dataset, then you will try to predict reviews_per_month, as a proxy for the popularity of the listing. Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

Note there is an updated version of this dataset with more features available here. The features were are using in listings.csv.gz for the New York city datasets. You will also see some other files like reviews.csv.gz. For your own interest you may want to explore the expanded dataset and try your analysis there. However, please submit your results on the dataset obtained from Kaggle.

Your tasks:

  1. Spend some time understanding the options and pick the one you find more interesting (it may help spending some time looking at the documentation available on Kaggle for each dataset).
  2. After making your choice, focus on understanding the problem and what each feature means, again using the documentation on the dataset page on Kaggle. Write a few sentences on your initial thoughts on the problem and the dataset.
  3. Download the dataset and read it as a pandas dataframe.

Solution_1

Points: 3

We went with option 1: predicting whether or not credit card clients will default their bills.

We think there are also a lot of unique traits to the data that comes from analyzing the cultural context of Taiwan. For example, the possible values for marital status are married, single, or others. They do not track engagement, or dating, common-law partnerships. All this information would be tracked in Canada, but in this dataset these statuses are grouped all together under the "other" bucket. We also found it very interesting that in the education field, education uses ordinal encoding but level 5 & 6 are both listed as "unknown." The Kaggle docs specifically states the following:

EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

We may need to do some feature manipulation to address this issue.

In [3]:
df = pd.read_csv("UCI_Credit_Card.csv")
df.head()
Out[3]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default.payment.next.month
0 1 20000.0 2 2 1 24 2 2 -1 -1 ... 0.0 0.0 0.0 0.0 689.0 0.0 0.0 0.0 0.0 1
1 2 120000.0 2 2 2 26 -1 2 0 0 ... 3272.0 3455.0 3261.0 0.0 1000.0 1000.0 1000.0 0.0 2000.0 1
2 3 90000.0 2 2 2 34 0 0 0 0 ... 14331.0 14948.0 15549.0 1518.0 1500.0 1000.0 1000.0 1000.0 5000.0 0
3 4 50000.0 2 2 1 37 0 0 0 0 ... 28314.0 28959.0 29547.0 2000.0 2019.0 1200.0 1100.0 1069.0 1000.0 0
4 5 50000.0 1 2 1 57 -1 0 -1 0 ... 20940.0 19146.0 19131.0 2000.0 36681.0 10000.0 9000.0 689.0 679.0 0

5 rows × 25 columns



2. Data splitting ¶


rubric={points:2}

Your tasks:

  1. Split the data into train (70%) and test (30%) portions with random_state=123.

If your computer cannot handle training on 70% training data, make the test split bigger.

Solution_2

Points: 2

In [4]:
train_df, test_df = train_test_split(df, test_size=0.3, random_state=123)



3. EDA ¶


rubric={points:10}

Your tasks:

  1. Perform exploratory data analysis on the train set.
  2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
  3. Summarize your initial observations about the data.
  4. Pick appropriate metric/metrics for assessment.

Solution_3

Points: 10

In [5]:
train_df.dtypes
Out[5]:
ID                              int64
LIMIT_BAL                     float64
SEX                             int64
EDUCATION                       int64
MARRIAGE                        int64
AGE                             int64
PAY_0                           int64
PAY_2                           int64
PAY_3                           int64
PAY_4                           int64
PAY_5                           int64
PAY_6                           int64
BILL_AMT1                     float64
BILL_AMT2                     float64
BILL_AMT3                     float64
BILL_AMT4                     float64
BILL_AMT5                     float64
BILL_AMT6                     float64
PAY_AMT1                      float64
PAY_AMT2                      float64
PAY_AMT3                      float64
PAY_AMT4                      float64
PAY_AMT5                      float64
PAY_AMT6                      float64
default.payment.next.month      int64
dtype: object
In [6]:
train_df["default.payment.next.month"].value_counts(normalize=True)
Out[6]:
default.payment.next.month
0    0.776762
1    0.223238
Name: proportion, dtype: float64

There is a class imbalance in the target data.

In [7]:
train_df.describe()
Out[7]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default.payment.next.month
count 21000.000000 21000.000000 21000.000000 21000.000000 21000.000000 21000.000000 21000.000000 21000.000000 21000.000000 21000.000000 ... 21000.000000 21000.000000 21000.000000 21000.000000 2.100000e+04 21000.000000 21000.000000 21000.000000 21000.000000 21000.000000
mean 14962.348238 167880.651429 1.600762 1.852143 1.554000 35.500810 -0.015429 -0.137095 -0.171619 -0.225238 ... 43486.610905 40428.518333 38767.202667 5673.585143 5.895027e+03 5311.432286 4774.021381 4751.850095 5237.762190 0.223238
std 8650.734050 130202.682167 0.489753 0.792961 0.521675 9.212644 1.120465 1.194506 1.196123 1.168556 ... 64843.303993 61187.200817 59587.689549 17033.241454 2.180143e+04 18377.997079 15434.136142 15228.193125 18116.846563 0.416427
min 1.000000 10000.000000 1.000000 0.000000 0.000000 21.000000 -2.000000 -2.000000 -2.000000 -2.000000 ... -50616.000000 -61372.000000 -339603.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 7498.750000 50000.000000 1.000000 1.000000 1.000000 28.000000 -1.000000 -1.000000 -1.000000 -1.000000 ... 2293.750000 1739.500000 1215.750000 1000.000000 8.200000e+02 390.000000 266.000000 234.000000 110.750000 0.000000
50% 14960.500000 140000.000000 2.000000 2.000000 2.000000 34.000000 0.000000 0.000000 0.000000 0.000000 ... 19102.500000 18083.000000 16854.500000 2100.000000 2.007000e+03 1809.500000 1500.000000 1500.000000 1500.000000 0.000000
75% 22458.250000 240000.000000 2.000000 2.000000 2.000000 41.000000 0.000000 0.000000 0.000000 0.000000 ... 54763.250000 50491.000000 49253.750000 5007.250000 5.000000e+03 4628.500000 4021.250000 4016.000000 4000.000000 0.000000
max 30000.000000 1000000.000000 2.000000 6.000000 3.000000 79.000000 8.000000 8.000000 8.000000 8.000000 ... 891586.000000 927171.000000 961664.000000 873552.000000 1.227082e+06 896040.000000 621000.000000 426529.000000 528666.000000 1.000000

8 rows × 25 columns

Summary Statistics

  • Sample dataset contains 21,000 credit card clients from Taiwan
  • Average credit limit is 167,880.65 TWD, average default rate in the next month is 22.32% (mean default.payment.next.month = 0.2232)
  • Demographics: majority sex is female (2) at 1.60, average education is leaning towards university (2) education at 1.85, marital status is leaning towards single (2) at 1.55, and average age is 35 years
In [8]:
# wrangling data so PAY, BILL_AMT, and PAY_AMT columns across the 6 months are combined

# change inconsistent naming
train_df.rename(columns={"PAY_0": "PAY_1"}, inplace=True)
train_df.rename(columns={"default.payment.next.month": "default_payment_next_month"}, inplace=True)

# Convert PAY_X columns to long format
pay_df = train_df.melt(id_vars=['ID'], value_vars=['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'],
                 var_name="Month", value_name="Repayment Status")
pay_df['Month'] = pay_df['Month'].str.extract('(\d+)')  # Extract month number
# Convert BILL_AMTX columns to long format
bill_df = train_df.melt(id_vars=['ID'], value_vars=['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'],
                  var_name="Month", value_name="Bill Amount")
bill_df['Month'] = bill_df['Month'].str.extract('(\d+)')  # Extract month number
# Convert PAY_AMTX columns to long format
pay_amt_df = train_df.melt(id_vars=['ID'], value_vars=['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'],
                     var_name="Month", value_name="Payment Amount")
pay_amt_df['Month'] = pay_amt_df['Month'].str.extract('(\d+)')  # Extract month number

# Merge all three dfs on ID and Month
df_long = pay_df.merge(bill_df, on=['ID', 'Month']).merge(pay_amt_df, on=['ID', 'Month'])
# Convert Month to integer for sorting
df_long['Month'] = df_long['Month'].astype(int)
# merge df_long with remaining columns
pay_columns = ['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
bill_columns = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
pay_amt_columns = ['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
id_vars = [col for col in train_df.columns if col not in pay_columns + bill_columns + pay_amt_columns]
df_long = df_long.merge(train_df[id_vars], on=['ID'])

month_mapping = {
    1: "Sep", 2: "Aug", 3: "Jul", 4: "Jun", 5: "May", 6: "Apr"
}
# Apply the mapping to the "Month" column
df_long["Month"] = df_long["Month"].replace(month_mapping)

# display result
df_long.head()
# confirm there are 6 rows (1 per month) per ID
df_long.query("ID == 1")
<>:10: SyntaxWarning: invalid escape sequence '\d'
<>:14: SyntaxWarning: invalid escape sequence '\d'
<>:18: SyntaxWarning: invalid escape sequence '\d'
<>:10: SyntaxWarning: invalid escape sequence '\d'
<>:14: SyntaxWarning: invalid escape sequence '\d'
<>:18: SyntaxWarning: invalid escape sequence '\d'
C:\Users\Helena\AppData\Local\Temp\ipykernel_16040\3940617332.py:10: SyntaxWarning: invalid escape sequence '\d'
  pay_df['Month'] = pay_df['Month'].str.extract('(\d+)')  # Extract month number
C:\Users\Helena\AppData\Local\Temp\ipykernel_16040\3940617332.py:14: SyntaxWarning: invalid escape sequence '\d'
  bill_df['Month'] = bill_df['Month'].str.extract('(\d+)')  # Extract month number
C:\Users\Helena\AppData\Local\Temp\ipykernel_16040\3940617332.py:18: SyntaxWarning: invalid escape sequence '\d'
  pay_amt_df['Month'] = pay_amt_df['Month'].str.extract('(\d+)')  # Extract month number
Out[8]:
ID Month Repayment Status Bill Amount Payment Amount LIMIT_BAL SEX EDUCATION MARRIAGE AGE default_payment_next_month
12426 1 Sep 2 3913.0 0.0 20000.0 2 2 1 24 1
33426 1 Aug 2 3102.0 689.0 20000.0 2 2 1 24 1
54426 1 Jul -1 689.0 0.0 20000.0 2 2 1 24 1
75426 1 Jun -1 0.0 0.0 20000.0 2 2 1 24 1
96426 1 May -2 0.0 0.0 20000.0 2 2 1 24 1
117426 1 Apr -2 0.0 0.0 20000.0 2 2 1 24 1
In [9]:
alt.data_transformers.disable_max_rows()

df_demo = df_long.drop(columns=["ID", "Month"])
df_demo_X = df_demo.drop(columns=["default_payment_next_month"])

demo = alt.Chart(df_demo).mark_bar().encode(
  alt.X(alt.repeat('row'), type='nominal'),
  alt.Y(alt.repeat('column'), aggregate='average', type='quantitative'),
  alt.Tooltip(alt.repeat('column'), aggregate='average', type='quantitative')
).properties(
  width=150,
  height=150
).repeat(
  row=["default_payment_next_month"],
  column=df_demo_X.columns
)

demo.properties(title = "Comparison Between Credit Card Clients that Did vs. Did Not Default in October")
Out[9]: